DATA PROCESSING AND ANALYSIS IN PYTHON LANGUAGE

Name - Surname = Ozan Can Demir

Student Number = 402533

Department = Data Science and Business Analytics

Project = Personal Loan Analysis

Project Purpose And Objectives

The purpose of this project to perform data analysis to fetch some meaningful insights and valuable information from the the consumer dataset. I will be using the dataset Personal Loan Analysis in order to analyze variables such as income, education, family and experience and their correlation between each other. And we will be also showing how they effect the consumer loan acceptance process and figure the customer profile out who will most likely to accept the offer for personal loan, based on the specific relationship with the bank across various features given in the dataset.

Learning Outcomes

Import Packages

Dataset

The Data which is used in my project was downloaded from the source: Data

The Data contains information on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan).

Details of Data Set

Columns Details

Let's remove ID and Zipcode, since we will not be using them in our analysis:

Rename columns "CD Account" and "CreditCard":

Our dataset after changes and updates has 5000 observations and 12 variables.

Missing Value Check

Let's visualize it:

We have no missing value in our dataset. Now, we are ok to move forward and start our data analysis.

Descriptive Statistics of Data

Let's make decriptive visualization on our dataset:

We can see the Min, Max, Mean and Standard Deviation for all key attributes of the dataset. "Income" has too much noise and slightly skewed right, "Age" and "Ezperience" are equally distributed.

Let's check skewness by numbers:

Now we will visualize the Skewness with graphs as below:

Summary of Observations about Dataset

Now , we will visualize "Experience" by distplot to see the distribution of the experience. Since there cannot be negative experience, if our Experience variable has any negative value, we will need to replace them with mean values of our Experience variable for better analysis purpose.

Based on the graph, we can see that we have negative values in our data set under the variable "Experience". Now, we will deal with these negative values.

Let's first get our mean of "Experience":

We need to store these negative values in another dataframe:

Let's also check how many negative experience values does our data have:

We have in total 642 values in our dataset with negative experience values. This is really high number of outliers and can easily effect our analysis. For accuracy reasons, now we need to remove these outliers from our dataset and replace them with mean of the Experience:

Now, we need make sure whether we replaced these negative values with mean or not. To do so:

As we can see, there is no negative value in our Experience variable anymore. Now, we are ok to move forward for Corelation Analysis.

In this section, we will visualize each variable except for "Education" in more visualized way that how the distributions are.

Now, since the variable "Experience" is important in this data an analysis, we will be checking the association of Experience with other quantitive variables on the below:

When we see the results above, we can claim that "Age" and "Experience" variables are higly associated with each other.

In order to observe the correlations between variables in an advanced way, we will be using heatmap as below:

We can see that:

As it has been stated above that "Experience" and "Age" variables are highly correlated to each other. If we will make analysis on "Age", then we will need to drop "Experience" variable from our dataset avoid multi-colinearity issue:

As a second check, it's time for control of the statement in which we have to analyze education, the status of the customers that are going to apply for the loan in the bank. What are the unique values in my education column?

So we will assign all the three values to one, two, three as Undergraduate, Graduate and Professional:

So now we will to analyze what are the categories of all the persons that are going to apply for the loan purpose in your bank for this purpose? So I have to group on the basis of this education or the score.

Our outcomes from this piechart:

1- Undergraduates 41.9%

2- Graduates(28.1%)

3- Professional(30%)

In our data, most of people have undergradute level of education.

Let's also check how income is distributed between education groups:

From the above plot we could say that Income of customers who availed personal loan are alomst same irrescpective of their Education. Now, we will make it more visualized:

As a result, customers who have availed personal loan seem to have higher income than those who do not have personal loan. We can also simply think about it that the people who have personal loans are mored tend to have more income than the other people who do not have the personal loans. It is because people with more income are able to make monthly payments. Let's visualize them seperately on the below:

RESULT

As a result, the importance of data analysis cannot be ignored. Companies are paying more and more importance on data analysis in order to optimize their process and maximize their profit. By data analysis, Amazon today can make strategic offers to its customers and Amazon sales has increased sligthly.

Another example is that in 2018, the Houston Rockets, a National Basketball Association, or NBA team, raised their game using Big Data. The Rockets were one of four NBA teams to install a video tracking system which mined raw data from games. They analyzed video tracking data to investigate which plays provided the best opportunities for high scores, and discovered something surprising. Data analysis revealed that the shots that provide the best opportunities for high scores are two-point dunks from inside the two-point zone, and three-point shots from outside the three-point line, not long-range two-point shots from inside it. This discovery entirely changed the way the team approached each game, increasing the number of three-point shots attempted. In the 2017-18 season, the Rockets made more three-point shots than any other team in NBA history, and this was a major reason they won more games than any of their rivals.

Python and its packages are getting more and more popular in the world and Python itself provides a lot of great functionality and flexibility for us to make analysis on big datasets.

REFERENCES

https://www.oreilly.com/library/view/python-for-data/9781449323592/

https://learn.datacamp.com/

https://www.coursera.org/professional-certificates/ibm-data-science